This is the first project of edwisor Data science career path.
This project is a supervised learning regression problem. Here the goal is to predict the number of bikes that can go on rent using the past two years data.
Several regression machine learning algorithms are tested to predict the number of bikes and mean absolute error, mean squared error, R squared are calulated to check the performance of these algorithms.
First the behavioral pattern of customers is studied accroding to the given data which contains factors like season, year, month, weather condition, temperature, wind speed. Then suitable factors which play a improtant role in the number of bikes are selected and fed to the algorithms.
By this kind of analysis we can ensure the future demand, by which we can prepare our stock as per the predicted numbers, this will increase the company revenue in return.
Here the required libraries and data is imported.
we have 731 observations which is for 2 years and 15 features which are:-
The target variable is 'cnt' which is the number of bikes.
import warnings
warnings.filterwarnings("ignore")
#Importing required libraries
import numpy as np
import pandas as pd
#Importing libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
#Reading dataset
bikeRent = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/edwisor-india-bucket/projects/data/DataN0103/day.csv",
index_col=0)
#Get Dimensions of dataset
bikeRent.shape
#Get first 5 rows
bikeRent.head()
#Statistical analysis of data
bikeRent.describe()
#Get names of column of dataset
bikeRent.columns
Before doing the analysis we need to first identify the categorical features and replace the numerical values with respective layman values.
This should be done for better understading of features in analysis.
#Create new dataset for Exploratory Data Analysis
data = bikeRent.copy()
#changing numeric to categorical and changing columns name to actual names
#data['Index'] = data['instant']
data['Date'] = data['dteday'].astype('category')
data['Season'] = data['season'].replace([1,2,3,4],['Spring','Summer','Fall','Winter']).astype('category')
data['Year'] = data['yr'].replace([0,1],['2011','2012']).astype('category')
data['Month'] = data['mnth'].astype('category')
data['Holiday'] = data['holiday'].replace([0,1],['Holiday','Working day']).astype('category')
data['Weekday'] = data['weekday'].astype('category')
data['Working Day'] = data['workingday'].replace([0,1],['Holiday','Working day']).astype('category')
data['Weather Condition'] = data['weathersit'].replace([1,2,3,4],['Clear, Few clouds, Partly cloudy, Partly cloudy',
'Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist',
'Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds',
'Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog']).astype('category')
data['Temperature'] = (data['temp']*(39 + 8)) - 8
data['Feeling Temperature'] = (data['atemp']*(50 + 16)) - 16
data['Humidity'] = data['hum'] * 100
data['Wind Speed'] = data['windspeed'] * 67
data['Casual Users'] = data['casual']
data['Registered Users'] = data['registered']
data['Count'] = data['cnt']
data = data.drop(columns = bikeRent.columns)
Here the analysis of various categorical feature with respect to target feature is visualised.
Analysis done :-
sns.distplot(data['Count'])
#Probability distribution of target variabel 'Count' can be seen as nearly normally distributed
#Separating colums by category
Categorical = ['Date','Season','Year','Month','Holiday','Weekday','Working Day','Weather Condition']
Continuous = ['Temperature','Feeling Temperature','Humidity','Wind Speed','Casual Users','Registered Users','Count']
#distribution of categorical variables with target variable
print(data.groupby('Season')['Count'].sum())
#Number of bikes hired season wise
print()
plt.gcf().set_size_inches(12,8)
sns.barplot(data=data,x='Season',y='Count',hue = 'Weather Condition')
Fall is the season when most of the bikes got hired followed by summer
Most of bikes were hired when the weather was clear or partially cloudy
And no bikes were hired in heavy rain and ice pallets condition.
plt.gcf().set_size_inches(12,6)
sns.barplot(data=data,x='Year',y='Count')
#Year wise most of the bikes were hired in 2012, might be beacause of the popularity of the company
plt.gcf().set_size_inches(12,6)
sns.barplot(data=data,x='Month',y='Count')
#most of bikes were hired in months of June,July,August,September which are months of summer and fall
plt.gcf().set_size_inches(12,6)
sns.barplot(data=data,x='Weekday',y='Count',hue = 'Holiday')
#most of the bikes were hired on 3 day of week
#Relation of continuous variable with target variable
sns.pairplot(data = data,
x_vars = ['Temperature','Feeling Temperature','Humidity','Wind Speed','Casual Users','Registered Users'],
y_vars = ['Count'])
#Distribution of continuous variable
data[Continuous].hist(bins = 50,figsize = (15,10))
Here analysis of missing value is done. As per analysis there is no missing value in data
#Missing value analyses
data.isnull().sum()
Outliers are the values which are far from other observations, they can impact mean and standard deviation with high variablity. So these values should be taken care of either by removing them or replacing them with suitable values like mean, median or mode.
Box plot method is used to detect the outliers in the data.
Certain outliers are detected in Humidity, windspeed and casual users.
#Detection of outliers
var1 = ['Temperature','Feeling Temperature','Humidity','Wind Speed']
plt.gcf().set_size_inches(10,5)
sns.boxplot(data = data[var1])
#from boxplot the outliers can be seen in Humidity and Wind speed
var2 = ['Casual Users','Registered Users']
plt.gcf().set_size_inches(10,5)
sns.boxplot(data = data[var2])
#here the outliers can be seen in Casual users
The outliers are removed using quarantile method.
#oulier removeal from Humidity
q75, q25 = np.percentile(data['Humidity'], [75 ,25])
iqr = q75 - q25
min = q25 - (iqr*1.5)
max = q75 + (iqr*1.5)
data = data.drop(data[data['Humidity'] < min].index)
data = data.drop(data[data['Humidity'] > max].index)
#outlier removal from windspeed
q75, q25 = np.percentile(data['Wind Speed'], [75 ,25])
iqr = q75 - q25
min = q25 - (iqr*1.5)
max = q75 + (iqr*1.5)
data = data.drop(data[data['Wind Speed'] < min].index)
data = data.drop(data[data['Wind Speed'] > max].index)
#outlier removal from casual users
q75, q25 = np.percentile(data['Casual Users'], [75 ,25])
iqr = q75 - q25
min = q25 - (iqr*1.5)
max = q75 + (iqr*1.5)
data = data.drop(data[data['Casual Users'] < min].index)
data = data.drop(data[data['Casual Users'] > max].index)
Here the coorelation between all the continuous variable including target varibale using heatman distribution is checked.
And from the distribution it is concluded that Temperature and Feelling temperature ane highly coorelated to each other.
Registered users is also highly corelated with the target variable.
Also date and casual users are not required to carry forward for machine learning algorithms.
So dropping four features Date, Feeling temperature, Registered users, casual users.
data[Continuous].corr().style.background_gradient(cmap='coolwarm')
CorrMat = data[Continuous].corr()
plt.gcf().set_size_inches(10,8)
sns.heatmap(CorrMat,annot =True)
sns.pairplot(data[Continuous])
#From the above plots it can be seen that :
#Temerature and Feeling Temperature are highly correlated
#Registered Users is also highly corelated with target variable count
#So dropping two variables Feeling Temperature,Registered Users from the dataset
#after expermenting to prevent model overfitting Casual users also need to be dropped
data.columns
data = data.drop(columns=['Date','Feeling Temperature','Registered Users','Casual Users'],axis=1)
data['Season'] = bikeRent['season'].astype('category')
data['Year'] = bikeRent['yr'].astype('category')
data['Month'] = bikeRent['mnth'].astype('category')
data['Holiday'] = bikeRent['holiday'].astype('category')
data['Weekday'] = bikeRent['weekday'].astype('category')
data['Working Day'] = bikeRent['workingday'].astype('category')
data['Weather Condition'] = bikeRent['weathersit'].astype('category')
#data = pd.get_dummies(data)
#Changing categorical to numerical for modelling input
data.head(2)
Dividing the data set into 80% training data and 20% test data.
from sklearn.model_selection import train_test_split
xTrain,xTest,yTrain,yTest = train_test_split(data.loc[:,data.columns != 'Count'],data['Count'] ,test_size = 0.2)
#Spliting dataset into train and test sets with ratio of 20 percent
xTrain.head(2)
Here the tools and techniques are used to develop a model to make predictions. Different evaluation matrices are used to evaluate the performance of different models. It helps to reinforce the confidence in the predictions.
These models are validated using Kfold cross validation and scores are compared to obtain the best performing model.
Data is tested with five regression algorithms including:-
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Lasso
models = []
models.append(('LR ',LinearRegression()))
models.append(('DTR',DecisionTreeRegressor()))
models.append(('RFR',RandomForestRegressor()))
models.append(('KNN',KNeighborsRegressor()))
models.append(('LSO',Lasso()))
#Using 5 regression algorithms to check which algorithm is best for this dataset
from sklearn.model_selection import KFold,cross_val_score
def ModellingAndEvaluationWithCrossValidation(models,scoring):
for name,model in models:
kfold = KFold(n_splits=10, random_state=0)
Scores = cross_val_score(model,xTrain,yTrain,scoring=scoring, cv=kfold)
Scores = np.sqrt(-Scores)
print(name ,' : ' ,Scores.mean())
def ModellingAndEvaluation(models,scoring):
for name,model in models:
model.fit(xTrain,yTrain)
predict = model.predict(xTest)
print(name ,' : ' ,scoring(yTest,predict))
#Created functions to check the model with cross validation and test set
All five models are evaluated using these performance matrices. Mean Absolute Error and Mean Squared Error are evaluated using Kfold cross validation. Then the obtained scored are compared to see the performance of the models.
Mean absolute error scores with Cross validation:-
LR : 24.756516372481055
DTR : 24.42818446609717
RFR : 21.72871879994421
KNN : 31.88993961677325
LSO : 24.74685025048984
Mean absolute error scores with test set:-
LR : 673.0230994298247
DTR : 667.4705882352941
RFR : 515.1220588235293
KNN : 1093.6750000000002
LSO : 671.4968494943123
from sklearn.metrics import mean_absolute_error
print("MAE scores with Cross validation:-")
ModellingAndEvaluationWithCrossValidation(models,'neg_mean_absolute_error')
print()
print("MAE scores with test set:-")
ModellingAndEvaluation(models,mean_absolute_error)
#Mean absolute error score
#Here we can see the error value of Linear Regression is way less than the other algorithms
Mean squared error scores with Cross validation:-
LR : 818.0101656656103
DTR : 867.841221570945
RFR : 667.7082018846622
KNN : 1218.8729421997991
LSO : 818.1679241703506
Mean squared error scores with test set:-
LR : 778193.4448160154
DTR : 875236.8014705882
RFR : 557750.5021323529
KNN : 1745766.0673529413
LSO : 775718.2731361163
from sklearn.metrics import mean_squared_error
print("MSE scores with Cross validation:-")
ModellingAndEvaluationWithCrossValidation(models,'neg_mean_squared_error')
print()
print("MSE scores with test set:-")
ModellingAndEvaluation(models,mean_squared_error)
#Mean Squraed error
#Tried and tested with different ratio of train and test set
#From Above scores we can see that Linear Regression algorithm is the best fit for the Bike Rental count prediction
MAPE scores:-
LR : 22.059973855505874
DTR : 23.094885769820984
RFR : 20.002969760733535
KNN : 38.408839925256736
LSO : 21.95474677633306
def MAPE(actual,predicted):
return np.mean((abs(actual-predicted))/actual)*100
print("MAPE scores with test set:-")
ModellingAndEvaluation(models,MAPE)
R squared scores with test set:-
LR : 0.830750294699801
DTR : 0.7753703908767058
RFR : 0.9041022858422911
KNN : 0.5952925705228493
LSO : 0.8308547664967059
from sklearn.metrics import r2_score
print("R squared scores with test set:-")
ModellingAndEvaluation(models,r2_score)
MAE = 447.4333823529412
MSE = 392496.15887647064
MAPE= 16.850845290622114
R2 = 0.9017552147082286
Form above training and evaluation it is concluded that random forest regressor is the best fit for the data set and above are the given values for the different evaluation metrices used. Here we can see that R2 value is quite high which indicated that the model sucessfully captures approx 90 percent of the data.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
RFR = RandomForestRegressor()
RFR.fit(xTrain,yTrain)
predict = RFR.predict(xTest)
MAE = mean_absolute_error(yTest,predict)
MSE = mean_squared_error(yTest,predict)
# MAPE = MAPE(yTest,predict)
R2 = r2_score(yTest,predict)
print('MAE = ',MAE)
print('MSE = ',MSE)
print('MAPE= ',MAPE)
print('R2 = ',R2)
#Aplied Random Forest regression to model and calculated MAE,MSE,MAPE and RSquared value
output = pd.DataFrame({'Actual value':yTest,'Predicted value':predict})
import pickle
pickle.dump(RFR, open('model.pkl','wb'))
from google.colab import files
files.download('model.pkl')